Appearance
📈 RAG, Multi-Agent Graphs & Telemetry Cheat Sheet
A quick-reference guide for semantic chunking, parent-child re-ranking pipelines, LangGraph state machines, OpenTelemetry trace instrumentations, and LLM-as-a-Judge evaluation matrices.
🗂️ Advanced RAG & Retrieval
1. Semantic Chunking & Splits
Instead of splitting text at static character boundaries, split text dynamically based on Sentence Vector Similarity Drops:
python
### Conceptual Python pseudocode for Semantic Chunking
import numpy as np
def split_semantically(sentences: list[str], embeddings: list[np.ndarray], threshold: float) -> list[str]:
chunks = []
current_chunk = []
for i in range(len(sentences) - 1):
current_chunk.append(sentences[i])
# Compute Cosine Distance drop between sentence i and sentence i+1
similarity = np.dot(embeddings[i], embeddings[i+1]) / (np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i+1]))
if similarity < threshold: # Similarity dropped = new topic started
chunks.append(" ".join(current_chunk))
current_chunk = []
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks🔁 2. Bi-Encoder vs. Cross-Encoder retrieval
Combine Bi-Encoders and Cross-encoders to achieve fast and highly precise context retrievals:
- Bi-Encoder: Compiles query and document embeddings independently in coordinate space. Search is ultra-fast ($O(\log N)$) but lacks query-to-context relational synthesis.
- Cross-Encoder: Feeds query and retrieved document together through self-attention layers in a transformer model, calculating exact contextual relevance (high-precision but slow; ideal for re-ranking small datasets).
🧠 LangGraph Multi-Agent topologies
Define agents as state-machine graphs. Below is the core template for a Supervisor-Worker Topology:
python
from typing import Annotated, TypedDict
from langgraph.graph import StateGraph, START, END
### 1. Define shared state dictionary
class AgentState(TypedDict):
task: str
research_notes: str
draft: str
iterations: int
### 2. Define node execution functions
def supervisor_node(state: AgentState):
print("Supervisor evaluating task...")
if not state.get("research_notes"):
return {"next_step": "researcher"}
return {"next_step": "writer"}
def researcher_node(state: AgentState):
notes = "Found 3 corporate database records matching query."
return {"research_notes": notes}
def writer_node(state: AgentState):
draft = f"Draft report based on: {state['research_notes']}"
return {"draft": draft}
### 3. Compile the State Graph
workflow = StateGraph(AgentState)
### 4. Add nodes to graph
workflow.add_node("supervisor", supervisor_node)
workflow.add_node("researcher", researcher_node)
workflow.add_node("writer", writer_node)
### 5. Define edges and conditional routing rules
workflow.add_edge(START, "supervisor")
workflow.add_conditional_edges(
"supervisor",
lambda state: state.get("next_step"), # Dynamic routing edge key
{
"researcher": "researcher",
"writer": "writer"
}
)
workflow.add_edge("researcher", "supervisor") # Cycle back to supervisor
workflow.add_edge("writer", END)
### 6. Compile executable graph runtime
app = workflow.compile()📊 Distributed Telemetry & Tracing Spans
Instrument your code to capture nested agent runs conforming to OpenTelemetry (OTel) standards:
python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
### 1. Initialize System Tracer Provider
provider = TracerProvider()
processor = SimpleSpanProcessor(ConsoleSpanExporter()) # Pipes logs to console or Langfuse
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("agent-observability")
### 2. Instrument nested execution spans
def run_agentic_pipeline(user_query: str):
with tracer.start_as_current_span("parent_agent_run") as parent_span:
parent_span.set_attribute("query", user_query)
# Nested Span 1: Database Memory Search
with tracer.start_as_current_span("vector_memory_search") as db_span:
db_span.set_attribute("vector_dimensions", 1536)
time.sleep(0.5) # Simulate database query latency
db_span.add_event("Memories fetched successfully.")
# Nested Span 2: LLM Inference call
with tracer.start_as_current_span("llm_generation") as llm_span:
llm_span.set_attribute("model_name", "gemini-1.5-flash")
time.sleep(1.2) # Simulate API latency
llm_span.set_attribute("tokens_generated", 256)🏆 LLM-as-a-Judge Evaluation Prompts
Run automated evaluation audits inside your CI/CD pipelines to quantitatively score agent behaviors:
1. Faithfulness Score (Detecting Hallucinations)
text
SYSTEM: You are a strict quantitative audit judge.
Your task is to evaluate if the GENERATED ANSWER is fully grounded in the provided CONTEXT.
CONTEXT:
{retrieved_context}
GENERATED ANSWER:
{agent_output}
Output a single JSON object containing:
- "verdict": "YES" if the answer contains only facts directly supported by the context, otherwise "NO".
- "hallucinated_sentences": A list of strings containing sentences in the answer that are not supported.
- "faithfulness_score": A float rating from 0.0 (fully hallucinated) to 1.0 (fully grounded).2. Answer Relevancy Score (Directness Check)
text
SYSTEM: You are a strict semantic evaluator.
Your task is to grade if the GENERATED ANSWER directly addresses the USER QUERY.
USER QUERY:
{user_query}
GENERATED ANSWER:
{agent_output}
Output a single JSON object containing:
- "relevancy_score": A float rating from 0.0 (completely off-topic) to 1.0 (perfectly addresses query).
- "missing_information": A list of points requested by the query but omitted in the answer.